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REMARKS 

These remarks are in response to the Office Action mailed September 11, 
2006. Claims 1-95 are pending in the application. Claims 1-48, 50, 65-78, 80 and 
89-95 have been withdrawn as directed to a non-elected invention. Claims 79-88 
have been joined with claims 49-64 in response to the Restriction Requirement. 
Accordingly, claims 49, 51-64, 79 and 81-88 are currently under Examination. 

The specification has been amended to correct a scrivener's error in drafting. 
One of skill in the art will recognize, for example, the dATP refers to a 
deoxyadenosine nucleotide triphosphate. Claims 94-95 have been canceled without 
prejudice to Applicants 1 right to prosecute the canceled subject matter in any 
continuation, continuation-in-part, divisional or other application. Claims 49, 51-52, 
56-57, 61, 64, 79, 81 and 84 have been amended. Claims 96 and 97 have been 
added. Support for the amendments and new claims can be found throughout the 
specification as filed. For example, the amendment to claims 56 and 57 are 
supported at page 8, paragraph 27. The specification further provides results 
demonstrating that an "increase" in cDNA levels relative to the control are indicative 
of a subject that is a candidate for cancer management (see, e.g., Figure 2). No new 
matter is believed to have been introduced. 

Applicants respectfully thank Examiner Schlapkohl and Examiner Guzo for the 
courteous telephonic interview conducted with Applicants 1 representative, Joseph 
Baker, and licensee's representative Dr. Les Overman, on December 11, 2006. The 
parties discussed the pending rejections and proposed claim amendments. No 
agreement was reached. 

L OATH/DECLARATION 

A substitute Declaration accompanies the present response. 

II. PRIORITY 

The Office Action indicates that Applicants' claim for the benefit of priority is 
acknowledged, but the priority document (60/488,660) allegedly lacks polynucleotide 
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sequences of any of the claimed polynucleotides or primers recited in the instant 
claims. Applicants respectfully traverse. 

The Examiner is respectfully reminded that neither examples nor DNA 
sequence are required to provide an adequate written description to support a claim 
if references contemporaneous with the filing date showed relevant genes and 
nucleotide sequences to demonstrate knowledge to those skilled on the art. Falkner 
v. Inglis, 448 F.3d 1357, 79 USPQd 1001 ( Fed. Cir. 2006). The position in Falkner 
is consistent with the long held position that a patent need not teach, and preferably 
omits, what is well known in the art. In re Buchner, 929 F.2d 660, 661, 18 USPQ2d 
1331, 1332 (Fed. Cir. 1991); Spectra-Physics, Inc. v. Coherent, Inc., 827 F.2d 1524, 
3 USPQ2d 1737 (Fed. Cir. 1987); Hybritech Inc. v. Monoclonal Antibodies, Inc., 802 
F.2d 1367, 1384, 231 USPQ 81, 94 (Fed. Cir. 1986), cert, denied, 480 U.S. 947 
(1987); and Lindemann Maschinenfabrik GMBH v. American Hoist & Derrick Co., 
730 F.2d 1452, 1463, 221 USPQ 481, 489 (Fed. Cir. 1984). The gene sequences 
recited in the present application were known and accessible in databases at the 
time of the provisional filing. Accordingly, Applicants submit that the provisional 
priority document supports the presently claimed invention. 

III. CLAIM OBJECTIONS 

Claims 49, 51-64, 79 and 81-88 stand objected to because the claims 
comprise non-elected subject matter. Claims 49, 51-52, 79, 81 and 84 have been 
amended to remove recitation of the non-elected subject matter. Accordingly, the 
objections may be properly withdrawn. 

IV. REJECTION UNDER 35 U.S.C. §11 2, SECOND PARAGRAPH 

Claims 61, 62 and 85 stand rejected under 35 U.S.C. §112, second paragraph 
as allegedly being indefinite for failing to particularly point out and distinctly claim the 
subject matter which applicant regards as the invention. Applicants respectfully 
traverse this rejection. 

The Office Action alleges that the term "minimally invasive" in claim 61 is a 
relative term which renders the claim indefinite (see, Office Action at page 5, lines 
11-18). Applicants have amended the claims to recite "non-invasive" in addition to 
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"minimally invasive". "Non-invasive" is supported in the specification as filed. Both 
"minimally invasive" and "non-invasive" are terms commonly used in the art. For 
example, "minimally invasive" is defined as a medical procedure that is carried out by 
entering the body through the skin or through a body cavity or anatomical opening, 
but with the smallest damage possible to these structures (see, e.g., 
http://en.wikipedia.org/wiki/Minimallyjnvasive). This definition is consistent with the 
use of swabbing to collect colon and rectal samples as described in the specification. 
Non-invasive is also described at http://en.wikipedia.org/wiki/Non-invasive_(medical) 
as a medical procedure which does not penetrate or break the skin or a body cavity, 
i.e., it doesn't require an (invasive) incision into the body or the removal of biological 
tissue. The term non-invasive is consistent with the use of stool sample collection as 
described in the specification. 

The Office Action also alleges that the recitation of "reagents for the 
preparation of cDNA" in claim 85 is indefinite (see, Office Action at page 4, line 19 to 
page 5, line 3). In particular, the Office Action alleges that it is unclear whether the 
recited primers are intended for use in the analysis of polynucleotides but not as 
reagents for the preparation of cDNA. Applicants respectfully submit that by the 
doctrine of claim differentiation, the reagents recited in claim 85 are primers other 
than the primers identified in claim 84. The specification, for example, indicates at 
the paragraph beginning on page 12, paragraph 38, that the other reagents may 
include "primers, enzymes, and other reagents" for the preparation, detection and 
quantitation of cDNA. 

Thus, Applicants submit that the terms used in claims 61 , 62 and 85 are not 
indefinite as the terms are recognized by one of skill in the art as set forth, for 
example, on the World Wide Web (claims 61 and 62) and by the doctrine of claim 
differentiation (claim 85). Accordingly, Applicants request withdrawal of the §112, 
second paragraph rejection. 

V. REJECTION UNDER 35 U.S.C. S112. FIRST PARAGRAPH (Written 
Description) 

Claims 49, 51-64, 79 and 81-88 stand rejected under 35 U.S.C. §112, first 
paragraph, as allegedly failing to comply with the written description requirement. 
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The claims allegedly contain subject matter which was not described in the 
specification in such a way as to reasonably convey to one skilled in the relevant art 
that the inventor(s), at the time the application was filed, had possession of the 
claimed invention. The Office Action alleges that the "controls" used in the claimed 
invention lack any structural information (see, Office action at page 7, lines 8-15). 
Applicants respectfully traverse this rejection. 

Claim 56 has been amended to reflect that the control is independently 
validated as a "normal" control. One of skill in the art can readily identify appropriate 
controls. Control samples are routinely used in diagnostics including, for example, in 
criminal investigation related to the identification of SNPs. Applicants have amended 
claim 57 to reflect that at least one cDNA is increased in the sample relative to the 
normalized control indicative of a patient that should be managed for colorectal 
cancer. 

In addition to the above, the Office Action alleges that the "claims do not 
provide any structural information with regard to the biological samples and controls 
which can be used such that patient care management... is achieved." (Id.) The 
Office Action further alleges that "the specification does not teach which samples 
should be compared to which controls such that patient care is managed" etc. (see, 
Office Action at page 8, line 22 to page 9, line 2). Applicants respectfully submit that 
"controls" are routinely used in the art of nucleic acid analysis. For example, the 
Examiner will recognize that various housekeeping genes are routinely used for 
quantitation of expression of a gene to be measured. Such basic scientific 
procedures of control comparisons are routinely used in the art. 

The Office Action further alleges that the results with colorectal cells are not 
necessarily predictive of any other biological sample or control and that the prior art 
does not describe a set of biological samples and controls that can be used such 
that expression of the claimed polynucleotides "dictate how patient care of patients 
with CRC or colorectal polyps should be managed." (see, Office Action at page 9, 
lines 3-21). In support of this position the Office relies upon a post-filing reference, 
Barrier et al. (see, Office Action at page 9, line 21 to page 10, line 11). Applicants 
submit that Barrier et al. analyzes different genes which allegedly describes gene 
expression measurements from tumor and adjacent non-neoplastic colon tissue 



Attorney's Docket No. 1034516-000006 
Application No. 10/690,880 
Page 19 

samples as a prognostic predictor model for stage II & III colon cancer. The Office 
Action alleges that the Barrier et al. reference indicates that more study is needed to 
arrive at a predictive model. Applicants respectfully submit that the Barrier et al. 
reference is not relevant to the predictive capabilities of Applicants' claimed 
invention. Applicants 1 claimed invention utilizes different genes and thus provides a 
different panel of markers not analyzed by Barrier et al. Merely because a reference 
using different "factors" arrives at a different conclusion is not indicative that the 
claimed invention lacks support for the claimed subject matter. Furthermore, 
Applicants submit that Barrier et al. has since published another manuscript which 
indicates that gene profiling is useful as a predictor of stage II colon cancer (see, 
Appendix A, attached hereto; Barrier et al., J. Clin Oncol. 24 (29):4685-91, Oct. 2006 
at "Conclusion," page 4685; see also, Ancona et al. 9 BMC Bioinformatics, 19(7):387, 
Aug. 2006). 

For at least the foregoing reasons, Applicants submit that the claimed 
invention was in Applicants' possession at the time of filing. Accordingly, Applicants 
respectfully request withdrawal of this rejection under §112, first paragraph. 

VI. REJECTION UNDER 35 U.S.C. S112. FIRST PARAGRAPH (Enablement) 

Claimed 49, 51-64, 79 and 81-88 stand rejected under 35 U.S.C. §112, first 
paragraph, as allegedly failing to comply with the enablement requirement. The 
claims allegedly contain subject matter which was not described in the specification 
in such a way as to enable one skilled in the art to which it pertains, or with which it 
is most nearly connected, to make and/or use the invention. Applicants respectfully 
traverse this rejection. 

The Office Action sets forth this rejection, in part, by reference to the Wands 
factors. 

Nature of the Invention 

The Office Action alleges that "[t]he invention is complex in that it involves 
measuring a change in the level of RNA by amplification, such that either patient 
care can be managed or such that upon comparison with normal controls, the 
method can be used for discovery of therapeutic interventions." (see, Office Action at 
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page 12, lines 17-21). Applicants submit that gene expression profiling methods for 
patient care are common in the art. 

Breadth of Claims 

The Office Action alleges that "[t]he claims are extremely broad in that they 
encompass methods for measuring the expression levels of polynucleotides from 
any biological samples and comparing such expression levels to any control such 
that the comparison issued in any aspect of the management of patient care..." 
(see, e.g., Office Action at page 13, lines 4-8). Applicants refer the Examiner to the 
amendments above and Applicants' remarks as they relate to the rejections 
discussed above. 

Guidance of the specification / The existence of working examples: 

The Office Action alleges that the specification fails to teach what a difference 
in expression means for patient care management or for the discovery of therapeutic 
interventions, how RNA expression measurements can be used to manage patient 
care or to discover new therapeutic interventions, how differences in expression of 
claimed polynucleotides can be used for risk assessment, early diagnosis, 
establishing a prognosis, monitoring patient treatment or detecting relapse, and that 
the specification allegedly teaches mRNA levels are not good predictors of protein 
expression and that to understand the expression level of proteins, and their 
complete structure, direct analysis of proteins is required, (see, Office Action at 
page 14, line 21 to page 15, line 19). Applicants submit that the specification 
teaches that AFP and CEA biomarkers have been used for over four decades and 
that biomarkers have five potential uses in the management of patient care: risk 
assessment, early diagnosis, establishing prognosis, monitoring treatment and 
detecting relapse. "Additionally, such markers could play a valuable role in 
developing therapeutic interventions." (See, e.g., page 4, paragraph 12). 
Furthermore, that "[V]alues for gene expression profiling for patient vs. normal 
control may vary either up, as in the case of IL 8, or down, as in the case of PPAR-y. 
It is the determination of the collective shift for the patient vs. normal control that is 
significant when using a panel of biomarkers." (See, e.g., page 8, paragraph 27). 
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Figure 2a teaches that 6 biomarker genes were examined in mouse MIN 
model colon polyps, five of which showed increased expression (SDF-1, COX2, 
CXCR2, OPN, MCSF1) and one showed decreased expression (PPAR-y) relative to 
wild-type littermate. Figure 2b teaches that 6 correlative human biomarker genes 
show similar expression differences between normal biopsy specimens and biopsy 
specimens from normal-appearing mucosa (either sigmoid and rectum or ascending 
colon) from colon cancer patients. A MANOVA analysis of a panel of 9 biomarkers 
shown in Figure 2c between 78 sigmoidal-rectal biopsies from 12 normal patients 
and 63 from non-cancerous sections of 6 patients with sigmoid rectal carcinoma 
demonstrates a significant difference in the combined expression of the biomarkers 
between the normal patient biopsies and the biopsies of non-cancerous sections 
from patients with sigmoid-rectal carcinoma. (See, e.g., page 9, paragraph 28 and 
Figure 2c). Applicants submit that armed with the teachings in the specification 
regarding the changes in gene expression profile between CRC patients and 
validated normal controls and the long history of using biomarkers for the 
management of patient care, the skilled artisan would know how to use the claimed 
invention. 

The Examiner is respectfully reminded that it is sufficient if the disclosure 
teaches those skilled in the art what the invention is and how to practice it. In re 
Grimme, Keil and Schmitz, 124 USPQ 449, 502 (CCPA 1960). A disclosure of 
every operable species is not required. One method is sufficient. It is not necessary 
that a patent applicant test all the embodiments of an invention. Amgen Inc. v. 
Chugai Pharmaceutical Co. Ltd., 927 F.2d 1200, 18 USPQ 2d 1016 (Fed. Cir. 1991) 
cert, denied 502 U.S. 856 (1991); In re Angstadt, 190 USPQ 214, 218 (CCPA); 
MPEP §2164.03. As long as the specification discloses at least one method for 
making and using the claimed invention that bears a reasonable correlation to the 
entire scope of the claim, then the enablement requirement of Section 112 is 
satisfied. In re Fisher, 427 F.2d 833, 839, 166 USPQ 18, 24 (CCPA 1970). The 
presence of only one working example should never be the sole reason for making a 
scope rejection. Training Materials for Examining Patent Applications with Respect to 
35 U.S.C. Section 112, first paragraph - Enablement Chemical/Biotechnical 
Applications. 
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State of the Prior Art 

The Office Action recites from the specification that the "discovery of panels 
useful in providing value in patient care management for CRC is in the nascent 
stage." (Office Action at page 16, lines 1-4). The Office Action also cites to post-filing 
art, Barrier et al. and Hao et a/., for the proposition that the state of the art is 
immature with respect to the use of gene expression profiling for diagnosis and 
disease management generally, and for management of patient care and discovery 
of therapeutic interventions for CRC and colorectal polyps in particular. Applicants 
submit that Barrier et al. has subsequently published (see, Appendix A) indicating 
that stage II cancers are predictable using genetic markers using techniques that are 
similar to those the Office Action indicates demonstrate the immaturity of the art. 

Predictability of the Art / Amount of Experimentation Necessary 

The Office Action alleges that the field is unpredictable and requires undue 
experimentation. In particular, the Office Action cites from Wu, Lucitini and Chen et 
al. for the proposition that correlating gene expression level to any phenotypic quality 
is unpredictable and "may, in part, be due to the fact that increased mRNA is not 
always indicative of protein expression levels, as indicated in the specification". 
(Office Action at page 18, lines 12-14). The Office Action further alleges that a large 
and prohibitive amount of experimentation would be required to make and use the 
claimed invention in order to establish expression differences were statistically 
significant. (Office Action at page 19, lines 4-6). Applicants submit that the 
specification teaches correlating differences of gene expression level of normal- 
appearing mucosa from colon cancer patients to validated normal controls, thus 
demonstrating that such a method in fact works. Furthermore, Applicants have 
amended claim 57 to reflect that the difference comprises an increase in at least one 
cDNA level in the sample relative to the control. This is supported at, for example, 
Figures 2. 

The Barrier et al. publication attached hereto as Appendix A contradicts that 
statement of the Barrier et al. publication cited in the Office Action. In particular, 
within approximately 13 months of the publication of the Barrier et al. publication 
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(Oncogene, 24:6155-6164, 2005), Barrier etal. published that, "Microarray gene 
expression profiling is able to predict the prognosis of stage II colon cancer patients," 
(see, e.g., Abstract, Barrier et al. , J. Clin Oncol. 24(29):4685-91, 2006), 
demonstrating that undue experimentation was not necessary. Accordingly, 
Applicants submit that the skilled artisan, with teachings of the specification in hand, 
could determine gene expression levels in samples using gene expression analysis 
methods routine in the art and compare to validated normal controls without undue 
experimentation. 

For at least the foregoing reasons, Applicants respectfully submit that the 
claimed invention is enabled. Accordingly, Applicants respectfully request 
withdrawal of the §112, first paragraph rejection. 

VI, NON-STATUTORY OBVIOUSNESS-TYPE DOUBLE PATENTING 

Claims 49, 51, 56-58, 60-64, 79, 81-83 and 88 stand provisionally rejected on 
the ground of non-statutory double patenting over claims 3-6,10 and 14 of copending 
Application No. 1 1/242,1 1 1 . Applicants acknowledge the rejection and request that 
the rejection by held in abeyance until such time as allowable subject matter is 
identified in either application. 

Applicants respectfully request that if there should be any questions regarding 
the foregoing amendments or remarks that the Examiner call the undersigned. The 
Commissioner is hereby authorized to charge any fee deficiency or credit any 
overpayment of fees to Deposit Account No. 02-4800. 



Respectfully submitted 



Buchanan Ingersoll & Rooney llp 



Date: January 9. 2007 




P.O. Box 1404 
Alexandria, VA 22313-1404 
858.509.7300 
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Stage II Colon Cancer Prognosis Prediction by Tumor Gene 
Expression Profiling 

Alain Barrier, Pierre-Yves Boelle, Francois Roser, Jennifer Gregg, Chantal Tse, Didier Brattlt, Francois Lacaine, 
Sidney Houry, Michel Hugnier, Brigitte Franc, Antoine Flahault, Antoinette Lemoine, and Sandrine Dudoit 

A B S T R A C T 

Purpose 

This study mainly aimed to identify and assess the performance of a microarray-based prognosis 
predictor (PP) for stage II colon cancer. A previously suggested 23-gene prognosis signature (PS) 
was also evaluated. 

Patients and Methods 

Tumor mRNA samples from 50 patients were profiled using oligonucleotide microarrays. PPs 
were built and assessed by random divisions of patients into training and validation sets (TSs and 
VSs, respectively). For each TSA/S split, a 30-gene PP, identified on the TS by selecting the 30 
most differentially expressed genes and applying diagonal linear discriminant analysis, was used 
to predict the prognoses of VS patients. Two schemes were considered: single-split validation, 
based on a single random split of patients into two groups of equal size (group 1 and group 2), and 
Monte Carlo cross validation (MCCV), whereby patients were repeatedly and randomly divided into 
TS and VS of various sizes. 

Results 

The 30-gene PP, identified from group 1 patients, yielded an 80% prognosis prediction accuracy 
on group 2 patients. MCCV yielded the following average prognosis prediction performance 
measures: 76.3% accuracy, 85.1% sensitivity, and 67.5% specificity. Improvements in prognosis 
prediction were observed with increasing TS size. The 30-gene PS were found to be highly-variable 
across TSA/S splits. Assessed on the same random splits of patients, the previously suggested 
23-gene PS yielded a 67.7% mean prognosis prediction accuracy. 

Conclusion 

Microarray gene expression profiling is able to predict the prognosis of stage II colon cancer 
patients. The present study also illustrates the usefulness of resampling techniques for honest 
performance assessment of microarray-based PPs. 

J Clin Oncol 24:4685-4691. © 2006 by American Society of Clinical Oncology 



Despite numerous clinical trials, the benefit of adju- 
vant chemotherapy for stage II colon cancer patients 
has never been proved in a randomized study. In 
most meta-analyses, there is a trend towards a ben- 
efit, but statistical significance is not reached. 1 In- 
cluding all stage II colon cancer patients in a 
randomized trial is debatable. Even if a properly 
designed study, comprising thousands of patients, 
demonstrated a significant benefit of adjuvant che- 
motherapy, it may not be logical to conclude that 
this treatment should be administered to all patients. 
Indeed, such a conclusion would not take into ac- 
count that three fourths of patients are cured by 
surgery alone and that the approach would lead to 
administering to all patients a treatment that would 
be useful for only a few. Another, more rational, 



approach would be to identify a subgroup of pa- 
tients at high risk of recurrence, thus more likely to 
benefit from adjuvant chemotherapy, and to include 
only these selected patients in a randomized trial. 
This presupposes finding accurate prognosis predic- 
tors (PPs) for stage II colon cancer patients. 

As for several types of malignant tumors 
(breast carcinomas, 2,3 lung carcinomas, 4,5 lympho- 
mas 6,7 ), microarray gene expression profiling has 
been reported to accurately predict the prognosis of 
stage II colon cancer. 8 In their report, Wang et al 8 
identified, from a set of 38 patients, a 23-gene prog- 
nosis signature (PS) that was validated on an inde- 
pendent set of 36 patients and yielded a 78% 
prognosis prediction accuracy. 

Fifty stage II colon cancer patients, with the 
same postoperative treatment (no adjuvant chemo- 
therapy) but with different outcomes (25 patients 
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developed a metachronous metastasis, whereas the other 25 remained 
disease free for at least 5 years), were included in the present study. 
Tumor samples were profiled using the Affymetrix (Santa Clara, CA) 
HGU133A GeneChip, with the following aims: (I) to identify a 
microarray-based PP and assess its performance in terms of accuracy, 
sensitivity, and specificity, and (2) to assess the prognosis prediction 
performance of the 23-gene PS proposed by Wang et al. 8 
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Patients and Samples 

Fifty patients (27 male, 23 female; mean age, 71 years) operated on for a 
stage II colon adenocarcinoma between 1996 and 2000 were included in this 
study. The main patient and tumor characteristics are given in Table 1 . None of 
the patients had emergency surgery or received any adjuvant chemotherapy. 
Twenty-five patients developed a distant metastasis (liver in 22 patients, lung 
in five patients) in the follow-up, and 21 within 3 years of surgery. The mean 
time to recurrence was 27 months (range, 14 to 52 months). The other 25 
patients remained disease free for at least 60 months, with mean follow-up of 
79 months (range, 60 to 101 months). 

Tumor samples were collected at time of surgery, with patients' informed 
consent, and were immediately stored in liquid nitrogen. Samples were re- 
viewed by a pathologist to check the presence of at least 80% of tumor cells. 
None of the 50 tumors exhibited microsatellite instability. RNA samples 
were extracted from the tumors and hybridized to Affymetrix HGU133A 
GeneChips according to previously described protocol. 9 

Gene expression measures were computed using the Robust Multichip 
Average method implemented in the Bioconductor R package rma (http:// 
www.bioconductor.org) and described in Irizarry et al. 10 Gene expression 
measures are available at http://www.u707.jussieu.fr/boelle/genechips/ 
index.html and http://www.stat.berkeley.edu/~sandrine. 

Data Analysis 

Prognosis prediction method. For a given split of patients into a trai rcing 
set (I S) and a validation set ( V'S), a 30-gene PP was built on the TS and its 
performance assessed on the VS as follows. 

Step 1. Gene expression measures were compared in recurrent and 
nonrecurrent TS patients by computing two-sample equal -variance t statistics 
for each of the 22,283 genes. A PS was defined in terms of the expression 
measures of the 30 genes with the largest absolute t statistics. 

Step 2, A PP was constructed by applying diagonal linear discriminant 
analysis (DID A) to the 30-gene PS of the TS patients. 1 1,12 

Step 3. The 30-gene PP from Step 2 was applied to predict the prognoses 
of the VS patients. 



Table 1. Patient and Tumor Characteristics 




Disease Free 


Recurrence 




in = 25) 


(n = 25) 


Sex 






Female 


13 


10 


Male 


12 


15 


Age, years 






Mean 


71.5 


70.0 


Range 


46-91 


41-84 


Differentiation 






Well/moderate 


20 


18 


Poor 


5 


7 


Location 






Right sided 


9 


7 


Left sided 


16 


18 





Step 4. The predicted and actual prognoses (recurrence or no recur- 
rence) of VS patients were compared to obtain the following three measures of 
prognosis prediction performance: accuracy (proportion of correctly pre- 
dicted prognoses), sensitivity (proportion of correctly predicted recurrences), 
and specificity (proportion of correctly predicted non recurrences). 

Validation procedure: Single-split validation. Two schemes were consid - 
ered for dividing patients into TS and VS: single-split validation and Monte 
Carlo cross validation. 

Patients were randomly divided into two groups of equal size, group 1 
and group 2. Group 1 and group 2 were used as TS and VS, respectively. A 
30-gene PP was built on group 1 patients and its performance assessed on 
group 2 patients. 

Validation procedure: Monte Carlo cross validation. For Monte Carlo 
cross validation (MCCVJ, 16 different values for the TS size n 0 were consid- 
ered: n Q = 10,12,. . .,40. For each choice of n 0 , the 50 patients were repeatedly 
and randomly divided into 100 TS of size n 0 and corresponding VS of size 
50-« 0 . For each TS/VS split, a 30-gene PP was identified on TS patients and 
applied to VS patients as described herein. This yielded, for each value of the TS 
size n Qi 100 30-gene PSs and 100 measures of prognosis prediction perfor- 
mance. The gene compositions of the 100 PSs were compared. Graphical and 
numerical summaries (eg, minimum, maximum, and mean) of the distribu- 
tions of prognosis prediction accuracies, sensitivities, and specificities for the 
16 X 100 = 1 ,600 TS/VS splits were obtained. 

Performance Assessment of the 23 -Gene PS 

The prognosis prediction performance of the 23-gene PS of Wang et al 8 
was assessed based on the same 1 6 X 1 00 random TS/VS splits of patients as for 
the 30-gene PS. For a given TS/VS split, a PP was obtained by applying DLDA 
to the 23-gene PS of the TS patients. This 23-gene PP was then applied to 
predict the prognoses of the VS patients. Predicted and actual prognoses 
(recurrence or no recurrence) of VS patients were compared. 

Proposal of a 30-Gene PS 

An overall 30-gene PS was identified based on all 50 patients, by compar- 
ing the expression measures of recurrent and nonrecurrent patients for each of 
the 22,283 genes using two-sample equal- variance t statistics and selecting the 
30 genes with the largest absolute t statistics. 



Single-Split Validation 

A 30-gene PS and corresponding PP were identified on the 25 
group 1 patients. Applied to the 25 group 2 patients, this 30-gene PP 
yielded 80% accuracy, 75% sensitivity, and 85% specificity. 

MCCV 

For each of the 16 values of the TS size n 0 , the 100 random splits 
of patients into a TS and a VS each yielded a 30-gene PP and corre- 
sponding measures of prediction performance on the VS (accuracy, 
sensitivity, specificity). Numerical summaries of the distributions of 
prognosis prediction performance measures for the 16 X 100 TS/VS 
splits were 76.3% mean accuracy (range, 52.5% to 100.0%), 85.1% 
mean sensitivity, and 67.5% mean specificity. Prognosis predic- 
tion performance improved with TS size (Figs 1 A and IB). ForTS 
of size 40, mean accuracy, sensitivity, and specificity were 82.7%, 
92.0%, and 73.4%, respectively. Sensitivity was higher than specificity 
for all TS sizes. 

The distribution of the number of selections for the set of 22,283 
genes is given in Table 2. The 1,600 30-gene PSs included a total of 
6,124 different genes; 3,080 of these 6,124 genes were selected only 
once, whereas 5,564 were selected fewer than 10 times; 55 genes were 
selected more than 100 times, and 14 more than 500 times. The most 
frequently selected gene was present in 1,176 PS (73.5%). 
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Fig 1. Monte Carlo cross validation. Prognosis prediction performance of 
30-gene prognosis signatures. (A> Mean, minimum, and maximum prognosis 
prediction accuracies as a function of the training set {TS) size that were observed 
for the 100 random splits of patients; (B) mean accuracy, sensitivity, and 
specificity as a function of the TS size that were observed for the 100 random 
splits of patients. 



Table 2. Distribution of the Number of Selections (of 1 .600 TS/VS splits) 
for the 22.283 Genes 



No. of Selections 


No. of Genes 


0 


16.159 


1 


3.080 


2 


1.048 


3-5 


1.014 


6-10 


422 


11-20 


251 


21-50 


181 


51-100 


73 


101-200 


31 


201-500 


10 


501-1,000 


7 


> 1.000 


7 



Proposal of a 30-Gene PS 

The 30 informative genes that were identified based on all n = 50 
patients are given in Table 3, with their t statistics, their permutation- 
based step-down maximum t statistics adjusted P values, 13 and their 
numbers of selections out of 1,600 TS/VS splits by MCCV (the num- 
bers of selections as a function of TS sizes are provided in Fig Al, 
online only). The step-down maxT multiple testing procedure (MTP) 
controls the family- wise error rate (ie, the chance of at least one 
false-positive among the 22,283 tests). Unlike the classical Bonferroni 
procedure, 13 the step-down maxT MTP takes into account the joint 
distribution of the test statistics and, hence, is generally more powerful 
than such marginal procedures. Permutation-based step-down 
maxT-adjusted P values were computed using the Bioconductor R 
package multtest (function mt.maxT with B = 10,000 permutations). 
All 30 genes of the overall PS are among the 33 genes most frequently 
selected by MCCV. Seven genes have an adjusted P value of .0001 and 
were selected in more than 70% of the 1,600 PSs of MCCV. Five 
additional genes have an adjusted P value lower than .002 and were 
selected in 49% to 56% of the 1,600 PSs of MCCV. Of the 30 genes, 10 
are overexpressed in patients who experienced a recurrence, and 20 
are overexpressed in patients who remained disease free, including 10 
genes encoding ribosomal proteins. 
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For each value of « 0 , 100 30-gene PSs were identified and their 
compositions compared. PS tended to be less variable for larger TS 
sizes. The total number of genes selected at least once decreased as the 
TS size increased (Fig 2A). With TS of 10 patients, no single gene was 
selected in more than 24 signatures; with TS of 40 patients, seven genes 
were selected in all 100 signatures (Fig 2B). 

Performance Assessment of the 23-Gene PS 

Assessed on the same 16 X 100 random TS/VS splits of patients, 
the overall mean accuracy of the 23 -gene PS 8 was 67. 1%. The mean 
accuracy increased with the TS size (Fig 3A). For each TS/VS split, 
accuracies of the 30- and 23-gene PSs were compared. For 1,190 
(74.4%) of the 1,600 splits, the 30-gene PS performed better than the 
23-gene PS (Fig 3B). 



The classical design of studies aiming to propose a prognosis pre- 
dictor based on gene expression profiling consists of identifying a 
prognosis signature and corresponding prognosis predictor from a 
TS and estimating the prediction accuracy of this PP on an inde- 
pendent VS. Such a single-split-validation design was applied in 
the first part of our study. Specifically, a 30-gene PP was built on a 
first group of 25 patients, using t statistic- based gene selection and 
diagonal linear discriminant analysis. The good performance of this 
30-gene PP, when applied to a second group of 25 patients (80% 
accuracy, 75% sensitivity, 85% specificity), suggests the ability to suc- 
cessfully predict the outcome of stage II colon cancer patients. How- 
ever, the reproducibility of results for studies based on single-split 
validation is questionable. In particular, the variability (ie, the extent 
to which the choice of TS affects) in the observed PP performance and 
PS composition is not taken into account. 
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Fig 2. Monte Carlo cross validation. 30-gene prognosis signature composition. 
(A} The number of genes that were included in at least one of the 100 signatures 
as a function of the training set (TS) size; (B) the number of genes that were 
included in at least 10, 25, 50, 75, and 100 of the 100 signatures as a function of 
the TS size. 



The results from MCCV clearly suggest the possibility to use gene 
expression profiling to predict the prognosis of stage II colon cancer 
patients. For the 16 X 1 00 30-gene PPs, the mean prognosis prediction 
accuracy was 76.3%; moreover, none of these 1,600 PPs yielded an 
accuracy lower than 50%. Mean sensitivity was higher than mean 
specificity (85.1% v 67.5%); this finding is of interest because the 
practical problem for stage II colon cancer patients, which underlies 
the present study, is the identification of the minority of these patients 
at high risk of metastatic recurrence, thus more likely to benefit from 
adjuvant chemotherapy. Performance consistendy increased with TS 
size to reach a maximum of 82.7% accuracy, 92.0% sensitivity, and 
73.4% specificity for TS of size 40. This suggests that, as expected, 
additional gains in performance could be obtained with predictors 
built on larger numbers of patients. 

MCCV also revealed great variability in PS composition and PP 
performance between random splits of patients. This variability, 
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Fig 3. Monte Carlo cross validation. Prognosis prediction (PP) accuracy of the 
23-gene prognosis signature. 8 (A) Mean accuracy of the 23-gene prognosis 
signature (PS) 8 (blue line) and 30-gene prognosis signature (red line) as a function 
of the training set (TS) size; (B) relative performance of the 23- and 30-gene PS 
for each of the 100 random TS/validation set (VS) splits of patients as a function 
of the TS size. 



which has been previously reported, 14 * 16 outlines the weakness of 
studies based on a unique split of patients. 

For a given TS and VS size, the range of observed accuracies was 
wide: 20% for the largest VS size, 40% for the smallest VS size. This 
suggests that the results of studies based on single-split validation 
should be interpreted with caution, because there is a risk to obtain 
overoptimistic performance estimates. In their report, Michiels et al 
used multiple random splits of patients from seven previously pub- 
lished studies, 2,4,5,7,17 " 19 and concluded that five of these studies did 
not classify patients better than did chance. 14 

PS composition was highly variable, especially for TS of small 
sizes; with TS of size 10, more than 2,200 different genes were included 
in the 100 30-gene signatures, meaning that the vast majority of these 
genes were selected only once. Variability of PS composition was 
also observed for larger TS, but it concerned only a subset of genes; 
with TS of size 40, 280 different genes were included in the 100 
30-gene signatures, but 12 of these genes were constantly, or almost 
constantly, selected. 
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Table 3. Composition of the 30-Gene Prognosis Signature Identified From the 50 Patients 




GenBank 




f 


Adjusted 


Selection by MCCV 


Affymetrix Probe ID 


Accession No. 


Gene Name 


Statistic 


P* 


INO. 


TO 


nanx 


Overexpressed genes in patients 
















who remained disease free 
















221943_x_at 


AW303136 


ribosomal protein L38 


-7.645 


.0001 


1176 


73.5 


1 


2l3642_at 


BE312027 


ribosomal protein L27 


-7.528 


.0001 


1169 


73.1 


2 


213350_at 


BF680255 


ribosomal protein S1 1 


-7.346 


.0001 


1134 


70.9 


3 


202028_s_at 


BC000603 


ribosomal protein L38 


-7.342 


.0001 


1116 


69.8 


4 


212044_s_at 


BE737027 


ribosomal protein L27a 


-7.311 


.0001 


1103 


68.9 


6 


212952_at 


AA9 10371 


calreticulin 


-7.302 


.0001 


1115 


69.7 


5 


216246_at 


AF1 13008 


ribosomal protein S20 


-7.153 


.0001 


1101 


68.8 


7 


218157__x_at 


NM_020239 


CDC42 small effector 1 


-8.443 


.0006 


890 


55.2 


9 


213826_s_at 


AA292281 


H3 histone, family 3A 


-6.266 


.0012 


833 


52.1 


10 


200630_x_at 


AV702810 


SET translocation (myeloid leukemia-associated) 


-6.147 


.0019 


785 


49.1 


12 


210231_x_at 


D45198 


SET translocation (myeloid leukemia-associated) 


-5.800 


.0047 


633 


39.6 


13 


2l6609_at 


AF065241 


thioredoxin 


-5.771 


.0050 


623 


38.9 


14 


202648_at 


BC000023 


ribosomal protein S19 


-5.622 


.0082 


492 


30.8 


15 


212953 x at 


BE251 303 


ralrptin ilin 


—5 47 1 


01 23 


430 


26.9 


17 


214001_x_at 


AW302047 


ribosomal protein SlO 


-5.438 


.0134 


364 


22.8 


19 


2 1 404 i_x_at 


n roc ~t "t "t o 

Dbo57772 


ribosomal protein L37a 


-5.426 


.0139 


378 


23.6 


18 


/i oo/y_at 


AV/2ob4b 


SMT3 suppressor of mif two 3 homolog 2 (yeast) 


— 5.348 


.0169 


355 


22.2 


20 


200908_s_at 


BC005354 


ribosomal protein, large P2 


-5.222 


.0248 


223 


13.9 


24 


209327_s_at 


BC000587 


mannan-binding lectin serine protease 1 (C4/C2 
artivatinn rnmnnnpnt nf Ra-r^artivp fpptnri 


-5.042 


,0427 


168 


10.5 


31 


205302 at 


MM 000596 


in^nlin-likp nrowth factor hinHinn nrntpin 1 




0R3K 


166 


10.4 


33 


Overexpressed genes in patients 
















with a recurrence 
















205550_s_at 


NM.004899 


brain and reproductive organ-expressed (TNFRSF1 A 
modulator) 


6.595 


.0003 


911 


56.9 


8 


213893_x_at 


AA161026 


postmeiotic segregation increased 2-like 2 


6.219 


.0014 


807 


50.4 


11 


210243_s_at 


AF038661 


UDP-Gal 


5.519 


.0108 


**/ / 


zy.o 


I o 


212608_s_at 


W85912 




5.366 


.0164 


348 


21.8 


21 


36554_at 


Y15521 


acetylserotonin O-methyltransferase-like 


5.189 


.0270 


336 


21.0 


22 


219481_at 


NM_024525 


tetratricopeptide repeat domain 13 


4.959 


.0543 


186 


11.6 


27 


20922 1_s_at 


AI753638 


oxysterol binding protein-like 2 


4.947 


.0559 


272 


17.0 


23 


212500_at 


AL049319 


chromosome 10 open reading frame 22 


4.942 


.0569 


172 


10.8 


29 


219038_at 


NMJ524657 


zinc finger, CW-type with coiled-coil domain 2 


4.933 


.0582 


192 


12.0 


25 


212435_at 


AA205593 




4.932 


.0586 


167 


10.4 


32 


Abbreviations: MCCV, Monte Carlo cross validation; FWER. family-wise error rate; maxT, maximum r statistic. 












*FWER-controlling permutation-based step-down maxT multiple testing procedure, implemented in the Bioconductor R package multtest. 


13 







The findings from MCCV strongly suggest that a unique PP does 
not exist and that many PSs lead to PPs with similar performances. 
This conclusion is consistent with the well-known fact that, especially 
for high -dimensional prediction problems, many models yield the 
same fit. 

In the present study, the following two main choices were made 
for building prognosis predictors: ( 1) the number of genes to include 
in the prognosis signature was set to 30, on the basis of previous 
results 9 ; and (2) prognosis predictors were constructed using DLDA, 
since DLDA was shown to be competitive with more complex tech- 
niques. 1 1,12 Both of these choices were somewhat arbitrary, and many 
other gene selection methods and classifiers could have been used. To 
determine the influence of our choices on results, we have reproduced 
exactly the same MCCV analysis as above with 30-gene t statistic- 
based prognosis signatures and nearest neighbor classifiers 1 1 (Fig A2, 
online only), and with DLDA based on prognosis signatures including 
various numbers of genes (from 10 to 200; Fig A3, online only). The 
results of these supplementary analyses suggest a moderate influence 
of the length of the PS and the choice of classifier on PP performance. 



The second aim of the present study was to assess the perfor- 
mance of the PS proposed by Wang et a!. 8 These authors built from 
a TS of 36 patients a PP based on the expression measures of 23 
genes, and applied this PP to a VS of 38 patients, with a 78% 
prognosis prediction accuracy. Interestingly, this 23-gene PS led to 
fairly accurate predictors for the prognosis of our patients (overall 
mean accuracy of 67.1% and a mean accuracy > 70% for TS of 
> 30). To our knowledge, this is the first time that a PS proposed by 
one research team is successfully validated by another research 
team. Since we used the same 1,600 random splits of patients, we 
were able to directly compare the performance of the 23-gene PP 
and the 30-gene PP. The mean prognosis prediction accuracy was 
76.3% for the 30-gene PP, and 67.1% for the 23-gene PP; for 1,190 
(74.4%) of the 1,600 splits, the accuracy of the 30-gene PP was 
higher than that of the 23-gene PP. We hypothesized that the 
observed differences in accuracy between 30-gene and 23-gene PP 
were mainly caused by the different criteria used to classify patients 
into the disease-free group (disease status after 5 years in our study 
v 3 years for Wang et al 8 ). This hypothesis was confirmed by results 
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of an additional study in which we considered the 3-year status of 
our patients (Fig A4> online only). 

MCCV allowed honest performance assessment of a prognosis 
prediction procedure, but did not lead to the identification of a unique 
prognosis signature and corresponding prognosis predictor. Instead, 
MCCV suggested that many combinations of genes could lead to PP 
with similar performances. Despite these findings, it seemed of inter- 
est to propose a single prognosis predictor that could be used by 
others. Since we applied on the whole set of 50 patients the same gene 
selection method than in MCCV, performance estimates of the pro- 
posed 30-gene PP are provided by results of MCCV. From a statistical 
point of view, all 30 genes do not have the same value. Two groups 
might be distinguished: a "stable" group of 12 genes, and a "variable" 
group of 18 genes. Seven genes had a permutation-based step-down 
maxT-adjusted P value of .000 1 ; they were selected on average 70% of 
the times by MCCV, and constantly with large TS. Five additional 
genes had an adjusted P value lower than .002; they were selected on 
average 50% of the times by MCCV, and almost constantly for large 
TS. It would be of interest to assess the performance of a "reduced" 
prognosis predictor containing these 12 "stable" genes. From a bio- 
logic point of view, the presence of 10 genes encoding ribosomal 
proteins in our proposed 30-gene prognosis signature is of particular 
interest. All 10 genes were overexpressed in patients who remained 
disease free. More remarkably, five of these 10 genes were among the 



seven genes with the lowest adjusted P values (.000 1 ) and were the five 
genes selected most often by MCCV. The best-known function shared 
by ribosomal proteins is their role in the assembly of ribosomal sub- 
units, and, as a result, their role in translation. Individual ribosomal 
proteins have been implicated in a wide variety of biologic func- 
tions, including cell cycle progression, apoptosis, and DNA dam- 
age responses. 20 " 23 It has also been suggested that their role in these 
processes may arise independently of their role in the ribosome 
itself. Our data raise the possibility that some ribosomal protein 
genes could play a role in tumor invasion, the latter being favored 
by their decreased transcription. 

In conclusion, the present study suggests the possibility of using 
functional genomic approaches to predict the prognosis of stage II 
colon cancer patients, thereby identifying a subgroup of patients at 
high risk of metastatic recurrence and thus more likely to benefit from 
adjuvant chemotherapy. At this point, it seems premature to claim 
that the decision to give patients a postoperative treatment should be 
based on their gene expression profiles. More rationally, we propose 
the use of gene expression profiling to select stage II patients to include 
in future studies aiming to assess the potential benefits of adjuvant 
chemotherapy. The present study also suggests the usefulness of resa- 
mpling techniques for honest performance assessment of microarray- 
based prognosis predictors. 
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Diagonal linear discriminant analysis: A mathemati- 
cal form of classifier that combines the component features by a 
weighted linear average. With gene expression based classifiers, 
the components are generally the logarithm of expression level of 
the selected genes. The weights are based on the degree of differ- 
ential expression of the individual genes among the classes. 

Monte Carlo cross validation: A method used for assess- 
ing variability in the performance of a classifier, by repeating 



split sample validation with random allocation to training and vali- 
dation sets. 

Training Set: Samples used in a developmental study to define a clas- 
sifier. The classifier can be internally validated in the test set of samples; 
those that were not used to develop the classifier. 

Validation Set: Samples used in evaluating performance of a classi- 
fier. The validation set is formed by the units not used in developing the 
classifier (ie, the training set and test set). 
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Abstract 

Background: In this paper we present a method for the statistical assessment of cancer predictors 
which make use of gene expression profiles. The methodology is applied to a new data set of 
microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia - Italy. 
The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients 
affected by colon cancer. We propose to give answers to some questions which are relevant for 
the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build 
accurate classifiers? What is the statistical significance of the associated error rates? In what ways 
can accuracy be considered dependant on the adopted classification scheme? How many genes are 
correlated with the pathology and how many are sufficient for an accurate colon cancer 
classification? The method we propose answers these questions whilst avoiding the potential pitfalls 
hidden in the analysis and interpretation of microarray data. 

Results: We estimate the generalization error, evaluated through the Leave-K-Out Cross 
Validation error, for three different classification schemes by varying the number of training 
examples and the number of the genes used. The statistical significance of the error rate is 
measured by using a permutation test. We provide a statistical analysis in terms of the frequencies 
of the genes involved in the classification. Using the whole set of genes, we found that the Weighted 
Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens 
with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant 
even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and 
Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error 
rate of e = 19% (p = 0.035) and e = 18% (/> = 0.037) respectively. Moreover, the error rate 
decreases as the training set size increases, reaching its best performances with 35 training 
examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 1 1 % (p = 
0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with 
the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and 
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SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 1 6% 
(p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined 
by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed. 

Conclusions; The method proposed provides statistically significant answers to precise question s 
relevant for the diagnosis and p rognosis of cancer . We found that, with as few as IS examples, it 
is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition 
of the number of genes sufficient for a reliable classification of colon cancer, our results suggest 
that it depends on the accuracy required. 



Background 

Gene expression from DNA microarray data offers biolo- 
gists and pathologists the possibility to deal with the 
problem of cancer diagnosis and prognosis from a quan- 
titative point of view [1]. Conventional tumor diagnosis 
consists of the examination of the morphological appear- 
ance of tissue specimens by trained pathologists. It is sub- 
jective and generally it does not allow the establishing of 
a unique therapy as tumors with similar histopathological 
appearances can follow different clinical courses [2]. Gene 
expression data provide a snapshot of the molecular status 
of a sample of cells in a given tissue, returning the expres- 
sion levels of thousands of genes simultaneously. They 
make it possible to analyze the genes involved in a partic- 
ular type of cancer [3] as well as the classification of tumor 
specimens in different categories [4,5]. Although DNA 
microarray data offer enormous opportunities for the def- 
inition and understanding of several pathologies, they 
hide potential pitfalls in their analysis and interpretation 
[6,7], A large number of overoptimistic results have been 
recently published in the literature regarding the possibil- 
ity of constructing very accurate prediction rules for cancer 
from only a few genes. Zhang et al. [8] showed that a three 
gene classification tree had an error rate of 2% in colon 
cancer diagnosis, and Guyon et al. [9] showed that a Sup- 
port Vector Machine (SVM) trained on only two genes had 
a zero Leave-One-Out (LOO) error in classifying patients 
with leukemia. 

There exists a twofold explanation for such misleading 
results. The first one concerns the data. Normally, a typical 
experiment of cancer classification by gene expression 
data consists of a few number € of specimens, between 10 
and 100 examples, each one of which is composed of a 
large number d (in the order of tens of thousands) of gene 
expression levels. We know that [10] the VC-dimension of 
the class of linear indicator functions in R d is d + 1. This 
means that the simplest classifier, consisting of a separat- 
ing hyperplane living in the space of the input specimens, 
is able to separate d + 1 points independently of their 
labelling. In the application at hand, where the number of 
features (gene expression levels) d is some order of mag- 
nitude greater than t, the possibility of separating per- 
fectly the specimens without errors is implied. This 



problem, known in machine learning literature as "over- 
fitting", is exactly the kind of problem that should be 
avoided in order to construct predictors able to generalize, 
i.e. which are able to correctly predict the labels of new 
specimens. 

The second reason concerns the methods of analysis. This 
can be better illustrated through some examples. It has 
just been said that the ultimate goal of a learning machine 
is that of generalizing. How is the generalization error of 
a predictor measured? What is the statistical significance 
of such a quantity given that it is measured by using only 
a few examples? Different methodologies will return very 
different answers. It is well know mat the LOO-error pro- 
vides an almost unbiased estimate of the generalization 
error of a predictor [11]. Although the bias of the said esti- 
mator is low, it is highly variable [6] and has little statisti- 
cal significance [12]. On the contrary, the Leave-K-Out 
Cross Validation (LKOCV) error provides a more signifi- 
cant estimate of the generalization error and it should be 
used to assess the accuracy of a classifier [12]. One further 
example concerns the methods that select a subset of 
genes to work with to reduce the problem of overfitting 
and for finding informative genetic markers of a particular 
pathology [8,9]. As Ambroise and McLachlan in [6] have 
admirably pointed out, such methods should carefully 
avoid the selection bias problem if reliable estimations of 
the generalization error of predictors are to be obtained. 
In the present paper a general methodology for the statis- 
tical assessment of prediction rules trained by using gene 
expression data is described, which can be seen as a natu- 
ral extension of [13] and [12]. The method answers pre- 
cise questions relevant to cancer diagnosis, avoiding the 
potential pitfalls connected to microarray data. In this 
study a new data set of gene expression data is used which 
was collected from 25 patients affected by colon cancer in 
"Casa Sollievo della Sofferenza" (CSS) Hospital, San Gio- 
vanni Rotondo (FG), Italy. The first set of questions posed 
concerns the data set. Is the size of the available data set 
sufficient to build accurate predictors? In which ways does 
accuracy depend on the prediction model? What is the sta- 
tistical significance of the prediction error measured? The 
second set of questions is about the number of gene 
expression levels. How many genes are correlated with the 
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pathology? How do the accuracy and the statistical signif- 
icance of the predictor change with respect to the number 
of the genes used? How does the adopted feature selection 
strategy influence the prediction error with respect to a 
random selection of genes? Answers to these questions 
were provided by using well established models for the 
classification of gene expression data. In particular we 
resorted to Weighted Voting Algorithm (WVA) classifiers 
[1,1 4 J, Regularized Least Squares (RLS) classifiers [15,16] 
and Support Vector Machine (SVM) classifiers [10]. For 
the assessment of the statistical significance of the classifi- 
cation errors measured, non parametric permutation tests 
[17,18] were adopted. 

Results 

Data set description 

Study population 

Twenty-five patients (14 males; mean age: 60 ± 14 years), 
who underwent colonic resection for colorectal cancer 
(CRC) at CSS hospital, were prospertively recruited into 
this study. Two samples from each patient were available, 
one from colon cancer tissue and one from normal 
colonic mucosa tissue. The samples had been obtained 
during the surgery, immediately frozen in liquid nitrogen 
and then stored at -80 °C. All of them were reviewed by 
the same experienced pathologist to confirm the histolog- 
ical diagnosis. None of the patients suffered from heredi- 
tary CRC or had received preoperative chemo- 
radiotherapy. Informed consent to take part in this study 
was obtained from all the patients. The study was 
approved by the Hospital's Ethics Committee. 

RNA extraction from fresh frozen tissue 

Total RNA from 150-200 mg of fresh frozen tissue was 
isolated by phenol-chloroform extraction (TRIzol Rea- 
gent, Invitrogen, Carlsbad, CA) and subsequently purified 
through column chromatography (RNeasy Mini Kit, Qia- 
gen, Valencia, CA) according to the manufacturer's 
instructions. RNA integrity was monitored using denatur- 
ing agarose gel electrophoresis in IX MOPS. Three neo- 
plastic samples were discarded from the final analysis 
since their RNA preparation was suboptimal. 

Microarray assays 

Biotinylated target cRNA was generated from 12 mg as 
described by the Affymetrix Expression Analysis Gene- 
Chip Technical Manual (Affymetrix, Santa Clara, Califor- 
nia). Briefly, double-stranded cDNA was synthesized from 
total RNA using the Superscript Choice System (Invitro- 
gen, Carlsbad, California), a primer containing poly(dT) 
and a T7 RNA polymerase promoter sequence. In vitro 
transcription using double-stranded cDNA as a template 
in the presence of biotinylated UTP and CTP was carried 
out using BioArray High Yield RNA Transcript Labeling 
Kit (Enzo Diagnostics, Farmingdale, New York). The 



resulting biotynilated-cRNA "target" was purified and 
quantified. Fifteen micrograms of biotinylated cRNA were 
randomly fragmented to an average size of 50 nucleotides 
by incubating in 40 mM TRIS-acetate, pH 8.1, 100 mM 
potassium acetate, and 30 mM magnesium acetate at 
94 °C for 35 minutes. The fragmented cRNA was hybrid- 
ized for 16 hours at 45 °C on Human Genome U133A 
GeneChips containing a total of 22,283 probe sets and 
after stained in a Fluidics station with streptavidin/phyco- 
erythrin, followed by staining through a streptavidin anti- 
body and streptavidin/phycoerythrin. Arrays were 
scanned on a Genearray scanner G2500A by using stand- 
ard Affymetrix protocols. Absolute data analysis was per- 
formed using the Affymetrix Microarray Suite 5.0 
software. 

Algorithms 

Estimating the number of training examples 
We are given a data set S = {(x,, y,), (x 2 , y 2 ), (x€, y£)} 
composed of € labelled specimens, where x, e R d and y, e 
{-1, 1 } for i = 1, 2,..., €. Let us suppose we have € + positive 
and €. negative examples, such that € - € + + In order to 
estimate the minimum number of examples to be used for 
the training of a classifier with a low error rate and a high 
statistical significance we used a two-step method: a cross 
validation procedure for the estimation of the error rate of 
classifiers trained through a given number of examples, 
and a permutation test for the assessment of the statistical 
significance of the classification accuracy obtained. In par- 
ticular, let n be the training set size, with n = 1, 2,...,€ - 1. 
For every value of n, s, pairs (D n , T k ) of training and test 
sets are built by random sampling without replacement 
into the data set S, with n and k as their respective exam- 
ples, where € = n + k. In the training/test split of the data, 
the same proportion of positive and negative examples as 
5 is preserved. For every random split, a classifier is trained 
by using the examples in D n and its error rate e n . is evalu- 
ated by testing it on T k . The selection of the parameter on 
which the classifier depends (C for SVM and X for RLS 
classifiers) is carried out by using the examples in D n only. 
In particular, the C parameter in SVM is selected minimiz- 
ing the three-fold cross validation error [19] and the X 
parameter in RLS is selected minimizing the LOO-error. 
Note that in the case of RLS, the evaluation of the LOO- 
error requires just one training [16]. This procedure for 
selecting the parameter ensures that e n . is unbiased as it 

does not involve the test set T k . So, for each value of n, the 
average error rate e n =-™-Xili e n I is evaluated. Notice 
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that when n = € - 1, the classical procedure for the meas- 
urement of the LOO-error which involves s, = € training/ 
test pairs (D€. v T,) is used. The second step consists of 
evaluating, for every n, the statistical significance of the 
error rate e n . In a nutshell, we are interested in measuring 
to what extent the accuracy observed is due to the existing 
correlation between gene expression levels x, and class 
labels y, v and how it is observed by chance because of the 
high dimensionality of the space where the examples live. 
In order to assess the statistical significance of the error 
rate the classical method of hypothesis testing is applied. 
Let H 0 be the null hypothesis in which it is assumed that 
the random variables x and y are independent. To evaluate 
the p-value corresponding to e n , it is necessary to know the 
probability density function of e n under the null hypothe- 
sis. Since this is unknown, a nonparametric permutation 
test [ 1 7] is needed, the latter being a method of estimating 
the empirical probability density function of any statistic 
under H 0 from the available data. In the contest of classi- 
fication, the method consists of a) permuting randomly 
the labels of the training set; b) training a random classi- 
fier on this randomly labelled training set and c) testing 
the classifier obtained on a test set having correctly 
labelled examples. The reason for this lies in the circum- 
stance that under the null hypothesis all the training sets 
generated through label permutations are equally likely to 
be observed, given that the random variables x and y are 
independent. Permutation test technique then allows us 
to determine the percentage of classifiers trained on ran- 
domly labelled data having an error rate less than e n in 
classifying correctly labelled data. In particular the follow- 
ing steps are carried out. For every random split of S in 
training and test sets (D n , T k ), we perform s 2 random per- 
mutations of the labels of examples belonging to the 
training set D n . Let D% be the training set with randomly 
permuted labels. For every permutation, a classifier is 
trained by using D% and the classifier itself is tested on 
the test set T k which has correctly labelled examples. Even 
in such a case, the parameter on which the classifier 
depends is selected by using only the examples in D% . Let 
us indicate with e n . . the error rate of the random classifier 

trained on n examples in the t-th cross validation and in 
the j-th random permutation. Then the empirical proba- 
bility density function of the error rate under the null 
hypothesis is: 



p«(*)=-^£i*(*-*«,,) 

s \ s 2 i=i ,=i 



(1) 



composed of a sum of delta functions centered on the 
errors measured. The statistical significance (p-value) of 
the error rate e n is given by the percentage of error rates 
smaller than e n . 

Estimating the number of genes 

The procedure described in the previous section makes it 
possible to determine the number n of training examples 
to use for building, in principle, an accurate and statisti- 
cally significant classifier. This section is focused instead 
on the following problems. How many genes are needed 
to classify a new specimen? What is the statistical signifi- 
cance of the error rate of a classifier trained by using n 
examples, each of which composed of a subset of g genes? 
In order to answer these questions a methodology is used 
similar to the one described in the previous section, with 
the main difference being that this time the specimens are 
composed of subsets of g genes. In particular, for every g = 
l,2,...,d, where d is the total number of genes available, s x 
pairs (D„, T€_ n ) of training and test sets are built by ran- 
dom sampling without replacement into the data set S, 
with n and € - n examples respectively. Also in this case, 
the same proportion of positive and negative examples as 
in S is preserved. It should be noted that here the number 
of training and test examples is constant. The training set 
is employed to rank the genes according to the value of the 
statistic [1]: 



J S2N 



M ^^S J« 1.2 d 

CT + 0) + cr_(j) 



(2) 



where; is the gene index. (// + (j), <j + (J)) and (u_ [}), a (/')) 
are the mean and the standard deviation of the expression 
levels of the j-th gene in the positive and negative exam- 
ples respectively, belonging to the current training set. By 
using the gene list thus sorted, reduced training and test 

sets ( D„, f €. n ) containing the same examples as the cur- 
rent training and test sets are built, each of which is com- 
posed of the g genes that are most correlated with the class 
labels. In particular, each example in the reduced training 
and test sets contains the expression levels of the first g/2 
and of the last g/2 genes in the list. Such a gene selection 
strategy provides better results than those provided by 
ranking the genes according to the absolute value of (2) as 
reported also in [1,14]. For every random split, a classifier 

is trained by using those examples in D„ having g compo- 
nents, and its error rate e g . is evaluated by testing it on 
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Tabte I : Error rate e and p-value p for different training set sizes. 



WVA RLS SVM 



n 


e 


P 


e 


P 


e 


P 


10 


25% 


0.078 


21% 


0.048 


21% 


0.053 


15 


24% 


0.056 


19% 


0.035 


18% 


0.037 


20 


23% 


0.066 


16% 


0.028 


15% 


0.026 


25 


21% 


0.045 


16% 


0.028 


14% 


0.022 


30 


21% 


0.050 


15% 


0.027 


13% 


0.017 


35 


19% 


0.069 


14% 


0.027 


11% 


0.019 


40 


21% 


0.102 


15% 


0.109 


12% 


0.022 


46 


21% 


0.493 


14% 


0.489 


11% 


0.495 



T i .„ having examples with g components too. Then, for 
every value of g, we evaluate the average error rate 

e g = -~X S Li^ • Two observations should be made. The 

first is that the procedure of gene ranking involves the 
examples in the training set only. That is to say, for each 
iteration the set of g genes is determined on the basis of 
the training examples only. The test set is thus out of the 
selection process. This makes the estimated error rate 
selection bias free [6]. The second is that, in general, after 
each cross validation the list of the g selected genes 
changes. 

The second step of the procedure consists in evaluating, 
for every g, the statistical significance of the error rate e g . 
For this purpose, for every random split of S, s 2 random 
permutations of the labels of examples in the reduced 
training set D n are performed. Let D£ be the training set 
with randomly permuted labels. For every permutation, a 
random classifier is trained by using D% and the classifier 
is tested on the reduced test set T€„ having correctly 
labelled examples. Let e & . . be the error rate of the random 

classifier trained on D£ in the i-th cross validation and in 
the ;-th random permutation. Then the empirical proba- 
bility density function of the error rate under the null 
hypothesis is: 

S 1 S 2»=1)=1 

composed of a sum of delta functions centered on the 
errors measured. The statistical significance {p-value) of 
the error rate e g is given by the percentage of error rates 
smaller than 



Frequency assessment of the genes selected 
It has been stated that the list of g genes selected in each 
cross validation changes because the selection of n exam- 
ples from the data set S is random. Nevertheless, since the 
statistic (2) assigns high scores in absolute value to the 
genes most correlated with the class labels, the most 
informative genes are expected to appear in the first/last 
positions of the list, irrespective of the n examples used for 
evaluating the T S2N statistic. Therefore the frequency /j of 
appearance of gene ; in the lists of the genes selected dur- 
ing the cross validation procedure can be used as a meas- 
ure of the importance of gene ; in the problem at hand. f } 
is given by the ratio between the number of appearances 
of the gene j in the top g positions and the number s, of 
cross validations. To assess the statistical significance of/-, 
it is necessary to resort to the permutation test. In particu- 
lar, 5, random drawings of n examples from S are per- 
formed and for each one of them s 2 random permutations 
of the labels of the n examples are carried out. For each 
random permutation of the labels, the genes are sorted 
according to the values of the statistic (2). The p-value 
associated to fj is given by the frequency of the gene; in the 
top g positions in the s, * 5 2 random permutations of the 
labels. 

Testing 

In this section we try to answer the numerous questions 
previously raised, showing the results of the methods 
described as applied to our colon cancer data set. Irrespec- 
tive of the classifier adopted, the genes are appropriately 
normalized to have zero mean and unit variance. In par- 
ticular, for each training and test pair with n and i-n exam- 
ples respectively, the n training examples are employed to 
compute the mean and variance of each gene and these 
parameters are used to normalize the genes in both train- 
ing and test set. Moreover, linear kernels in RLS and SVM 
classifiers are used. 

Training set size 

The first question posed concerns the data set size. How 
many examples are sufficient for an accurate classification 
of microarray data of colon cancer? The answer depends, 
of course, on the classification model adopted. Table 1 
shows the error rate e and the p-value p of three classifica- 
tion schemes, obtained by varying the number of training 
examples. The error values were estimated performing Sj = 
500 cross validations and s 2 - 500 random permutations 
of the labels. WVA reaches its minimum error rate of e = 
19% with n = 35 examples, but this estimate has a poor 
statistical significance (p > 5%). The best performance of 
this model on our data set is reached with n = 25 training 
examples, providing an error rate of e = 21% (p = 0.045). 
This table shows that WVA has a limited learning ability, 
because the error rate does not decrease significantly as 
the number of training examples is increased (see fig. la). 
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a) 



b) 



c) 




Figure I 

Error rate of a) WVA, b) RLS and c) SVM classifiers varying 
the training set size. 



RLS and SVM classifiers show a different behavior. Both 
methods provide classifiers with error rates of e <> 19% (p 
< 5%) with only a few training examples, and their ability 
of separating tumor from normal specimens improves as 



the number of training examples increases. The best per- 
formances of these classifiers are obtained with n - 35 
examples. Moreover, the error rate does not improve by 
increasing the number of training examples, suggesting 
that n = 35 is the optimal number of examples to use for 
the training of accurate RLS or SVM classifiers (see fig. lb 
and lc). The behavior of the statistical significance of the 
three classifiers odopted as a function of the training set 
size is shown in figure 2. As the picture shows, the LOO 
error exhibits poor statistical significance. Such evidence, 
reported in [12] as well, seems counter-intuitive if associ- 
ated to its having been obtained by using the maximum 
training set size. This is immediately evident if we associ- 
ate it to the test set size. In the LOO error procedure, the 
test set is made up of a single example and the likelihood 
that a random classifier can correctly classify the test 
example by chance is high. The likelihood decreases as the 
test set size increases. Having the same the number of 
training examples, RLS and SVM classifiers show compa- 
rable p-values which are always smaller than those of 
WVA. It should be noted that in all the classification 
schemes, the LOO error (last row in table 1), in spite of its 
poor statistical significance, shows values which are com- 
parable to the ones of the LKOCV error when n is 30 or 35 . 
This means that the LOO error provides a good estimate 
of the generalization error of a learning machine [11] and 
it can be used as a valid alternative to LKOCV error to 
compare the performances of different classification rules. 
This aspect is relevant for RLS classifiers which require just 
one training for the evaluation of the LOO error [16]. 
Moreover, our results coincide with the ones described in 
[12] where it is shown that 10-20 examples suffice for the 
training of classification rules with a statistically signifi- 
cant error rate. 

Number of genes 

The second question concerns the number of genes. How 
many genes are sufficient for an accurate classification of 
gene expression data of colon cancer? In order to be able 
to answer this question, we applied the method described 
in the section Algorithms. First of all, the number of genes 
differentially expressed in our data set, i.e. the ones having 
a statistically significant value of the statistics (2) had to 
be determined. To do this, we evaluated (2) on the actual 
data set and determined the number of genes having a 
value of the statistics greater than a given threshold. The 
denoted curve "observed" in figure 3 depicts the number 
of genes as a function of the statistics T S2N in the actual 
data set. Every point (x, y) of the curve represents the 
number y of genes g such that T S2N (g) ^ jc. The same pro- 
cedure was applied on data sets with randomly permuted 
class labels. Every point (jc, y) of the curve denoted 1% 
(5%) in figure 3 represents the number y of genes g having 
t S2n{&) ^ x witn p-value p <, 1% (5%). In this analysis we 
carried out 1000 random permutations of the labels of the 
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Figure 2 

Estimated statistical significance for different training set sizes 
using WVA, RLS and SVM classifiers. 



whole data set. As shown in the picture (see the point 
where observed and 5% curves intersect), about 6000 
highly expressed genes (p < 5%) were found in the two 
classes: 3000 genes more highly expressed in normal tis- 
sues (figure 3a) and 3000 more highly expressed in tumor 
tissues (figure 3b). 

Table 2 shows the error rate e and the p-value p of three dif- 
ferent classifiers, obtained by varying the number of the 
genes used. We used n = 25 examples for the training of 
WVA classifiers and n = 35 examples for those of RLS and 
SVM classifiers. We used s x = s 2 = 500 in this case as well. 

Table 2: Error rate e and p-value p of classifiers trained with a 
fixed number of examples and a different number of genes. 





WVA 




RLS 




SVM 


s 


e 


P 


e 


P 


e 


P 


22283 


21% 


0.045 


14% 


0.027 


1 1% 


0.019 


1 6384 


20% 


0.065 


14% 


0.021 


11% 


0.025 


8I92 


18% 


0.073 


14% 


0.034 


14% 


0.039 


4096 


16% 


0.116 


14% 


0.021 


14% 


0.039 


2048 


15% 


0.168 


14% 


0.034 


14% 


0.033 


1 024 


14% 


0.216 


13% 


0.024 


13% 


0.040 


512 


13% 


0.118 


13% 


0.028 


14% 


0.033 


2S6 


13% 


0.127 


13% 


0.040 


14% 


0.025 


128 


13% 


0.139 


13% 


0.036 


14% 


0.013 


64 


13% 


0.142 


13% 


0.036 


14% 


0.022 


32 


13% 


0.131 


13% 


0.022 


14% 


0.031 


16 


14% 


0.242 


13% 


0.030 


14% 


0.040 


8 


15% 


0.202 


14% 


0.029 


14% 


0.041 


4 


16% 


0.165 


14% 


0.041 


16% 


0.031 


2 


19% 


0.213 


16% 


0.046 


16% 


0.041 



It should be noted that WVA always provides error rates 
with a poor statistical significance, except when the whole 
set of genes is used. Moreover, the behavior of e as a func- 
tion ofg shows that this classification model is highly sen- 
sible to the noise embedded in the gene expression data. 
In fact, when the less informative genes are discarded 
from the classification process, the error rate improves sig- 
nificantly down to 13% with only 32 genes. On the con- 
trary, RLS classifiers show good statistical significance and 
poor sensibility to the noise because the error rate remains 
unchanged, as it were, in the whole range of values of g. 
Nevertheless, they are not able to exploit the information 
embedded in the less informative genes as fully as SVM 
does. When the whole set of genes is employed, the error 
rates of RLS and SVM are e = 14% (p = 0.027) and e - 1 1% 
(p = 0.019) respectively and the errors do not change 
when the 74% of genes {g « 1 6384) is used. The error rates 
of the two machines can be compared only when the 37% 
of genes [g = 8192) is used. These results point out that 
SVM is not influenced by the noise embedded in the data 
and, most of all, that it is able to exploit the subtle differ- 
ence between normal and tumor specimens hidden in the 
less informative genes. Moreover, the results described 
above show that several cell products are altered in colon 
cancer and that an accurate classification is possible only 
by taking into account the expression levels of thousands 
of genes simultaneously. 

Frequency analysis of the genes selected 

In order to analyze the frequency of appearance ^ of the 
gene ; = 1, 2,..., d in the lists of the genes g selected in the 
cross validation procedure, Sj = 100 random drawings of n 
= 35 examples from the data set S were carried out; for 
each drawing, the genes were sorted according to the value 
of the statistic (2). The frequency fj was evaluated by 
counting the presence of the gene j in the top g = 2048 
positions (the first 1024 and the last 1024) in the lists of 
the sorted genes. Figure 4a) depicts the frequencies of all 
the genes available. It can be seen that more than half of 
the genes do not appear in the top g positions of the list. 
Moreover, 1078 genes were found (467 more highly 
expressed in normal specimens and 61 1 in tumor ones) to 
have a frequency greater than 80% (see figure 4b) and, 
among these, 516 had a frequency of 100%. Aiming to 
assess the statistical significance of these frequencies, we 
performed s 2 = 100 random permutations of the labels of 
the n examples in each random drawing. Figure 4c) 
depicts the number of genes with/j£ 80% of which having 
a given p-value. Thanks to this analysis, 647 statistically 
significant genes (p < 0.05) were found. 

Biological analysis 

Among the statistically significant genes, 92 genes differ- 
entially expressed between normal tissue and matched 
tumour tissue, are reported in tables 3 and 4. Most genes 
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b) 

Figure 3 

Number of genes more highly expressed in a) normal and b) 
tumor tissues determined in the actual data set (observed 
curve) and in data sets with randomly permuted class labels 
(1% and 5% curves) for different values of the T S2N statistics. 



have been already shown to be involved in colorectal tum- 
origenesis. A brief description of 45 genes up- and 47 
genes down -regulated in tumour tissue, which could be 
used as diagnostic biomarkers or targets for therapy, is 
reported. At least 31 genes of cell cycle have been shown 
to be up-regulated in our colon cancer specimens. The 
mitotic checkpoint is an important signalling cascade that 
arrests the cell cycle in mitosis when even a single chro- 
mosome is not properly attached to the mitotic spindle 
[20]. It has been postulated that defects in the levels of 
mitotic checkpoint proteins could be responsible for 
mitotic checkpoint impairment and aneuploidy with dis- 
ruption of genomic integrity. However, until now, no 
functionally significant sequence variations of mitotic 
checkpoint genes has been detected in colorectal cancer 



[21 J. Conversely, we found that 6 genes involved in the 
mitotic spindle checkpoint (TTK, BUB1, BUB3, CDC20, 
MAD2L1, and BUB IB) are overexpressed in colon cancer 
specimens. Very recently, an increased expression of 
mitotic spindle checkpoint transcripts has been reported 
in breast cancers with chromosomal instability [22] sug- 
gesting that mitotic checkpoint impairment in human 
tumor cells (and chromosomal instability) could be due 
to increased levels of mitotic checkpoint proteins rather 
than mutations in checkpoint genes. In tumour, these 
changes could occur through altered transcriptional regu- 
lation by tumour suppressors or oncogene products. 
Drugs that specifically and efficiently interfere with 
mitotic checkpoint signalling could therefore be useful as 
anticancer agents. Another process which is deeply disor- 
ganized in cancer is cell growth with several cellular proc- 
esses and mechanisms that control cell cycle progression 
deregulated. In non neoplastic cells, these events are 
highly conserved due to the existence of conservatory 
mechanisms and molecules such as cell cycle genes and 
their products: cyclins, cyclin dependent kinases, Cdk 
inhibitors (CKI) and extra cellular factors (i.e. growth fac- 
tors). At least 25 genes of cell cycle progression have been 
shown to be up-regulated in our colon cancer specimens. 
They include CDC2, the universal inducer of mitosis, cyc- 
lin B and CDC25, which interact with the CDC2 to regu- 
late both Gl/S and G2/M transitions (checkpoints) of the 
cell cycle, and the MCM genes which are required for the 
entry in S phase and for genome duplication. 

Four up-regulated genes involved in the cell cycle progres- 
sion are of particular interest in colon tumorigenesis: 
CKS1, CKS2, SKP2, and FOXM1. Both CKS1 and SKP2 are 
involved in regulation of Gl/S transition and in degrada- 
tion of CDKN1B (p27) a putative gene suppressor. Color- 
ectal tumours with high levels of CKS1 and SKP2 
generally exhibit a more aggressive behaviour and are 
associated with low levels of CDKN1B (p27) and loss of 
tumor differentiation [23]. Moreover, CKS2 is expressed 
at significantly higher levels in colorectal tumors with 
liver metastasis [24]. Apart from their prognostic signifi- 
cance, these genes could also represent optimal targets for 
gene therapy. Recently, the effect of transfection of Cksl- 
specific small interfering RNA (siRNA) in human Cksl- 
overexpressing H358 lung cancer cell lines has been 
tested: Cksl siRNA down-regulated Cdc2 kinase activity 
and induced G2/M arrest. Long-term treatment of Cksl 
siRNA induced caspase activation and apoptosis [25]. The 
FOXM1 gene is critical for Gl/S transition and essential 
for transcription of cell cycle genes such as SKP2 and CKS1 
[26]. Other 7 up-regulated genes involved in cell mitosis 
are STK15, SRPK1 and TOP2A, and SMC4L1, CNAP1, 
HCAP-G, and KIF4A. All of them have been found overex- 
pressed in some cancer lines and some tumour cells and 
may represent both prognostic indicators and molecular 
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Figure 4 

Frequency analysis of the genes selected, a) Frequencies of all 
the genes in the top g = 2048 positions in the sorted gene 
list. The frequencies of the highly expressed genes in normal 
and tumor specimens are indicated with HN and HT respec- 
tively, b) Number of genes with frequency £ 80% and c) the 
number of genes with a given p-value. 



target for anticancer drugs. STK15 is a critical centrosome- 
associated kinase-encoding gene overexpressed in multi- 
ple human tumour cell types which is involved in the 
induction of centrosome duplication-distribution abnor- 
malities, chromosomal instability, and aneuploidy in 
mammalian cells [27]. It could represent an optimal tar- 
get for chemotherapy. SRPK1 and TOP2A are part of a 
multisubunit complex, named toposome, containing 
ATPase/helicase proteins (RNA helicase A and RHII/Gu), 
HMG protein (SSRP1), and pre-mRNA splicing factors 
(PRP8 and hnRNP C) which is involved in separating 
entangled circular chromatin DNA during chromosome 
segregation. In particular, SRPK1 plays a central role in the 
pre-mRNA splicing, a critical step in the posttranscrip- 
tional regulation of gene expression. Aberrant patterns of 
pre-mRNA splicing have been established for many 
human malignancies. Recently, it has been shown that 
SRPK1 is overexpressed in tumors of the pancreas, breast, 
and colon and siRNA-mediated down-regulation of 
SRPK1 in tumour cell lines results in a dose-dependent 
decrease in proliferative capacity and increase in apoptotic 
potential [28]. These findings support SRPK1 as a new, 
potential target for the treatment of cancer. 

Finally, SMC4L1, CNAP1, and HCAP-G are components 
of die condensin complex, which also contains other four 
subunits: SMC2L1, BRRN1, CAPH, and CAPD2 [29]. 
KIF4A is proposed to be a motor protein carrying DNA as 
cargo in condensed chromosomes throughout mitosis 
interacting with condensin complex [30]. The condensin 
complex is required for conversion of interphase chroma- 
tin into mitotic-like condense chromosomes. Interest- 
ingly, CDC2, the universal inducer of mitosis, 
phosphorylates HCAP-G, CNAP1, and BRRN1, thus acti- 
vating the condensin complex and chromosome conden- 
sation. Among the up-regulated genes in colorectal cancer, 
we found 14 genes involved in signal transduction 
(TDGF1 and ENC1), transcription (SOX9, MYC, and 
HGFR/MET), nuclear transport (NUP62, NUPL1, 
NUP155, KPNA2, RANBP5, CSE1L/CAS, NTF2, and 
RANBP1) and cellular transport (SLC04A1). TDGF1, a 
growth factor with an EGF-like domain, is over-expressed 
in breast, cervical, ovarian, gastric, lung, colon, and pan- 
creatic carcinomas in contrast to normal tissues where 
TDGF1 expression is invariably low or absent. TDGF1 is 
released or shed from expressing cells and may serve as an 
accessible marker gene in the early to mid-progressive 
stages of breast and other cancers [31]. ENC1 is another 
transduction gene probably involved in differentiation of 
epithelial cells as well as in cell proliferation. ENC1 is reg- 
ulated by the beta-catenin/Tcf pathway and up-regulated 
in colorectal cancer where it may suppress differentiation 
of colonic cells [32]. SOX9 is a transcription factor and 
seems to be expressed throughout the intestinal epithe- 
lium under the control of the Wnt-pathway. Its function 
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Table 3: 45 genes up-regulated In tumoral tissue, comparing normal mucosa to matched tumor coton tissue. 



Function 


Gene 


OMIM 


Accession no. 


p-vafue 


Gene description 


Cell cycle: mitosis (spindle 


TTK 


604092 


NM 003318.1 


0.029 


Threonine-tyrosine kinase 


checkpoint) 












BUB I 


602452 


AF043294.2 


0.035 


Budding uninhibited by benzimidazoles 1 homolog (yeast) 




BUB3 


6037 1 9 


NM 004725. 1 


0.037 


Budding uninhibited by benzimidazoles 3 homolog (yeast) 




CDC20 


6036 1 8 


NM 001255.1 


0.044 


Cell division cycle 20 




MAD2LI 


602686 


NM 002358.2 


0.049 


MAD2 (mitotic arrest deficient, yeast, homolog) like- 1 




BUB IB 


602860 


NM 00121 1.2 


0.050 


Budding uninhibited by benzimidazoles 1 homolog beta (yeast) 


Cell cycle: G0/GI 


INSIGI 


602055 


NM 005542.1 


0.039 


Insulin induced gene 1 (cell division cycle, GO to Gl) 


transition 












Cell cycle: mitosis (Gl/S 


CKS2 


I 1 690 1 


NM 001827.1 


0.047 


CDC28 protein kinase regulatory subunit 2 


checkpoint) 












CKSIB 


1 1 6900 


NM 001826.1 


0.046 


CDC28 protein kinase regulatory subunit IB 




SKP2 


60 1 436 


BG 105365 


0.050 


S-phase kinase-associated protein 2 (p45) 




FOXMI 


60234 1 


NM 021953.1 


0.045 


Forkhead box M 1 




MCM4 


602638 


AA 604621 


0.036 


Minichromosome maintenance deficient (S. cerevisiae) 4 




MCM3 


602693 


NM 002388.2 


0.048 


Minichromosome maintenance deficient (S. cerevisiae) 3 




MCM7 


600592 


D557I6.I 


0.048 


Minichromosome maintenance deficient 7 (S. cerevisiae) 




MCM2 


I 1 6945 


NM 004526.1 


0.049 


Minichromosome maintenance deficient (S. cerevisiae) 2 




MCM6 


60 1 806 


NM 005915.2 


0.050 


Minichromosome maintenance deficient (S. pombe) 6 


Cell cycle: mitosis (Gl/S 


CRKRS 




M68520.I 


0.039 


Cdc2-related kinase, arginine/serine-rich 


and G2/M checkpoints) 












CDC2/CDK I 


1 1 6940 


NM 001786.1 


0.044 


Cell division cycle 2, G 1 to S and G2 to M 




CDC25A 


H6947 


NM 001789.1 


0.050 


Cell division cycle 25A 




CDC25B 


II 6949 


NM 021873.1 


0.050 


Cell division cycle 25B 




CCNA2 


1 23835 


NM 001237.1 


0.050 


Cyclin A2 


Cell cycle: mitosis (G2/M 


CCNBI 


1 23836 


Hs.23960 


0.047 


Cyclin B 1 (cell division cycle. G2 to M) 


checkpoint) 












CCNB2 


602755 


NM 004701.2 


0.047 


Cyclin B2 (cell division cycle, G2 to M) 




NEK2 


604043 


NM 002497.1 


0.037 


NIMA (never in mitosis gene a)-related kinase 2 


Cell cycle: mitosis 


STKI5 


602687 


NM 003600.1 


0.039 


Serine/threonine kinase 6 (chr segregation) 




SRPKI 


60 1 939 


NM 003137.1 


0.046 


SFRS protein kinase 1 (chr segregation) 




TOP2A 


126430 


NM 001067.1 


0.050 


Topoisomerase (DNA) 11 alpha (170 kD) (chr segregation) 




KIF4A 


300521 


NM 012310,2 


0.035 


Kinesin family member 4A (spindle formation/chr condensation) 




CNAPI 


609689 


NM.0 14865 


0.046 


Chromosome condensation-related SMC-associated protein 1 




SMC4LI 




NM 005496.1 


0.048 


SMC4 structural maintenance of chromosomes 4-like 1 (yeast) 




HCAP-G 


606280 


NM 022346.1 


0.042 


Chromosome condensation protein G (chr condensation) 


Signal transduction 


TDGFI 


187395 


NM 003212.1 


0.048 


Teratocarcinoma-derived growth factor 1 (EGF signaling) 




ENCI 


605173 


NM 003633.1 


0.048 


Pig 10, ectodermal-neura! cortex (WNT//beta-catenin pathway) 


Transcription 


SOX9 


608160 


NM 000346.1 


0.045 


Sex determining region Y-box 9 




MYC 


190080 


NM 002467.1 


0.047 


V-myc avian myelocytomatosis viral oncogene homolog 




HGFR/MET 


164860 


NM 002467.1 


0.047 


Met proto-oncogene 


Transport: intracellular 


NUP62 


605815 


NM 012346.1 


0.039 


Nucleoporin 62 kD 




NUPLI 


607615 


NM 007342.1 


0.050 


Nucleoporin-like 1 




NUPI55 


606694 


NM 004298.1 


0.045 


Nucleoporin 155 kD (NUPI55) 




KPNA2 


600685 


NM 002266.1 


0.045 


Karyopherin alpha 2 (RAG cohort 1, importin alpha 1) 




RAN BPS 


602008 


NM 002271.1 


0.050 


RAN binding protein 5 or karyopherin (importin) beta 3 




CSEIUCAS 


601342 


NM 001316 


0.050 


CSE 1 chromosome segregation 1 -like (yeast) 




NXTI 


605811 


NM 005796.1 


0.050 


Nuclear transport factor 2 (NTF2) 




RANBPI 


601180 


NM 002882.2 


0.048 


RAN binding protein 1 


Transport 


SLC04AI 


605495 


NM 016354.1 


0.048 


Solute carrier family 2 1 (organic anion transporter) 



may be to maintain healthy and tumor epithelial cells in 
undifferentiated state [33]. MYC and HGFR/MET are two 
well-known oncogenes which activate the transcription of 
growth-related genes. Overexpression of MYC and HGFR/ 
MET is implicated in the aetiology of a variety of tumours 
and would serve as an important therapeutic target. Eight 



genes involved in nudeocytoplasmic transport were up- 
regulated in colon cancer. Nuclear-cytoplasmic transport, 
which occurs through special structures called nuclear 
pores, is an important aspect of normal cell function, and 
defects in this process have been detected in many differ- 
ent types of cancer cells. 
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Table 4: 47 genes down-regulated in tumoral tissue, comparing normal mucosa to matched tumor colon tissue. 



Function 


Gene 


HMIM 

uniri 


Accession no. 


p- value 


Gene description 


Apo ptosis 


PDCD4 


6086 1 0 


NM 014456.1 


0.032 


Programmed cell death 4 (neoplastic transformation inhibitor) 




FAS 


604306 


NM 000043.1 


0.044 


Fas (TNF receptor superfamily, member 6) 




CASP7 


60 1 76 1 


NM 001227.1 


0,050 


Caspase 7, apoptos is- related cysteine protease 


Transport 


SLC30AI0 




NM 018713.1 


0.036 


Solute carrier family 30, member 10 (zinc transport?) 




SLC9A2 


600530 


AF073299.I 


0.041 


Solute carrier family 9 (sodium/hydrogen exchanger), member 2 




SLC4A4 


603345 


AF0695IO.I 


0.041 


Solute carrier family 4, sodium bicarbonate cotrans porter, member 4 




SLC26A3 


126650 


NM 0001 1 I.I 


0.044 


Solute carrier family 26. member 3 




SLC26A2 


6067 1 8 


AI0255I9 


0.044 


Solute carrier family 26 (sulfate transporter), member 2 




SGK2 


607589 


NM 016276.1 


0.038 


Serum glucocorticoid regul. kinase 2 (potassium channel activation) 




KIF5C 


604593 


NM 004522.1 


0.040 


Kinesin family member 5C (intracellu-lar transport) 




KIFI3B 


607350 


NM 015254.1 


0.046 


Kinesin family member 1 3B (intracellular transport) 




VAPA 


605703 


AF 154847.1 


0.047 


VAMP (vesicle-associated membrane protein)-assoc. protein A.33 kDa 


Signalling 


MAP2K4 


601335 


NM 022129.1 


0.033 


Mitogen-activated protein kinase kinase 4 (MAPK signaling pathway) 




RPS6KA5 


603608 


AF074393.I 


0.040 


Ribos. prot. S6 kinase, 90 kDa, polyp. 5 (MAPK signalling pathway) 




MEF2C 


600662 


L08895.I 


0.033 


MADS box transcr. enhancer factor 2, (MAPK signalling pathway) 




PPP2R3A 


604944 


NM 002718.1 


0.037 


Protein phosphatase 2, regulatory sub-unit B, alpha (Wnt signalling) 




PDE9A 


602973 


NM 002606.1 


0.040 


Phosphodiesterase 9A (signal transduc-tion) 




PPAP2A 


607 1 24 


AF0 14403.1 


0.042 


Phosphatidic acid phosphatase type 2A (signal transduction) 




MUC4 


158372 


/\|242547.l 


0.044 


Mucin 4 (Erb2 signalling pathway) 




DSCRI 


6029 1 7 


AL049369.I 


0.045 


Down syndrome critical region gene 1 (signal transduction) 




SHOC2 


602775 


NM 007373.1 


0.046 


Soc-2 suppressor of clear homolog (MAPK signaling pathway) 




SOCS2 


6051 17 


NM 003877.1 


0.049 


Suppressor of cytokine signaling 2 (GH/IGFI signaling pathway) 




SMAD2 


601366 


NM 005901.1 


0.049 


SMAD, homolog 2 (Drosophila) (TGF-beta_signaling) 


Cell-surface signalling 


TSPAN7 


300096 


NM 004615.1 


0.036 


Tetraspanin 7 




EDG2 


602282 


NM 001401. 1 


0.041 


Lysophosphatidic acid G-protein-coupled receptor, 2 




TMPRSS2 


602060 


AF270487.I 


0.046 


Transmembrane protease, serine 2 




CEACAM7 




NM 006890.1 


0.047 


Carcinoembryonic antigen- related cell adhesion molecule 7 


Cell adhesion 


DSC2 


125645 


NM 004949.1 


0.045 


Desmocollin 2 


Cell differentiation 


NDRG2 


605272 


NM 016250.1 


0.038 


NDRG family member 2 




EPB4 I L3 


605331 


NM 012307.1 


0.044 


Erythrocyte membrane protein band 4. i -like 3 (suppressor gene?) 




MTUSI 


609589 


NM 024307.1 


0,045 


Mitochondrial tumor suppressor 1 


Metabolism 


HMGCL 


246450 


NM 000191. 1 


0.040 


3 -hydroxy methyl-3-methylglutaryl-Coenzy me A lyase 




UGDH 


603370 


NM 003359.1 


0.041 


UDP-glucose dehydrogenase 




CAI2 


603263 


NM 001218.2 


0.044 


Carbonic anhydrase XII 




CA2 


259730 


NM 000067.1 


0.049 


Carbonic anhydrase II 




CA4 


1 14760 


NM 000717.2 


0.050 


Carbonic anhydrase IV 




CAI 


1 14800 


NM 001738.1 


0.050 


Carbonic anhydrase 1 




CA7 


1 14770 


NM 005182.1 


0.050 


Carbonic anhydrase VII 




HPGD 


601688 


U63296.I 


0.046 


Hydroxyprostaglandin dehydrogenase I5-(NAD) 




FUCA1 


230000 


NM 000147.1 


0.047 


Fucosidase, alpha-L-l, tissue 




ACATI 


607809 


NM 000019.1 


0.048 


Acetyl-Coenzyme A acetyl transferase 1 




ADHIC 


103730 


NM 000669.2 


0.048 


Alcohol dehydrogenase3 (class 1), gamma polypeptide 




AQP8 


603750 


NM 001 169.1 


0.050 


Aquaporin 8 


Cell growth 


FAMI07A 


608295 


NM 007177.1 


0,040 


Family with sequence similarity 107, member A (TU3A) 




EMPI 


602333 


NM 001423.1 


0.047 


Epithelial membrane protein 1 (growth arrest) 




BTGI 


109580 


NM 00173 I.I 


0.050 


B-cell o-anslocation gene 1 . anti-proliferative 




KLF4 


602253 


NM 004235.1 


0.050 


Kruppel-like factor 4 (gut) 



Overproduction of nuclear transport factors such as 
KPNA2, RANBP5, NTF2, and CSE1L/CAS may disrupt the 
nuclear import and export machinery leading to loss of 
nuclear transport of several proliferation activating pro- 
teins, transcription factors, oncogene and tumour sup- 
pressor gene products and, finally, to cell transformation 
(34]. One up-regulated gene with transport function has 
been detected: SLC04A1/OATP1 belongs to a membrane 
transport systems superfamily with multiple expression in 



the liver, kidney, small intestine, and choroid plexus bar- 
rier. It acts as a mediator in the sodium-independent 
transmembrane solute transport and has a strategic posi- 
tion for absorption, distribution and excretion of xenobi- 
otic substances [35]. At least 3 genes involved in apoptosis 
have been shown to be down-regulated in our colon can- 
cer specimens. FAS and CASP7 are involved in the activa- 
tion cascade of caspases responsible for apoptosis. Both 
could be involved in tumour progression and poorer 
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prognosis as shown in urothelial cancer [36]. PDCD4 is a 
well known tumour suppressor gene involved in apopto- 
sis and inhibition of protein translation. Loss of PDCD4 
is associated with tumour progression and prognosis [37] 
while overexpression of PDCD4 in human colon carci- 
noma cells is able to suppress tumour progression by 
inhibiting c-Jun and AP-1 pathways [38]. These findings 
implicate a potential value of PDCD4 as a molecular tar- 
get in cancer therapy. Molecular transport and cell metab- 
olism are strongly impaired in cancer cells. Consequently 
it is not surprising that microarray analysis revealed 
down-regulation of several genes coding for proteins of 
transport and metabolism. Loss of carriers profoundly 
affects the intracellular concentration of solutes such as 
sodium, potassium, hydrogen, and bicarbonate which are 
involved in several metabolic pathways. Loss of enzymes 
which control the most important metabolic pathways 
have a negative influence on cell physiology and, most 
importantly, might render cancer cell less sensitive or 
resistant to anticancer drugs. 

Of relevance is the down-regulation of most carbonic 
anhydrases which control pH homeostasis and modulate 
the behaviour of cancer cells. In our specimens, several 
isozymes of carbonic anhydrases (I, II, IV, VII, and XII) 
were down-regulated implying a pathogenic role in cancer 
development or progression. Several genes coding for pro- 
teins involved in intracellular and cell surface signalling 
pathways were down-regulated in colon cancer. In our 
analysis, down-regulation of genes such as MAP2K4, 
RPS6KA5, MEF2C, SHOC2 produces a serious impair- 
ment of the MAPK signalling cascade involved in cell 
growth and differentiation. Similarly, other down-regu- 
lated genes such as PPP2R3A, MUC4, SOCS2 and SMAD2 
may contribute to impair Wnt, Erb2, GH, and TGF-beta 
pathways involved in several cellular processes. NDRG2, 
EPB41L3, MTUS1 are three down -regulated genes impli- 
cated in cell differentiation. They represent three candi- 
date tumour suppressor genes and are often inactivated in 
tumours [39,41]. Their relevance in colon cancer progres- 
sion and prognosis is still to be determined. Other three 
down-regulated genes implicated in negative control of 
cell growth have been identified by microarray analysis: 
FAM107A (TU3A), BTG1, and KLF4. TLB A has been 
found also down regulated in renal cancer cells [42]: even 
if its molecular function is unknown, it could represent a 
novel suppressor gene. BTG1 is an antiproliferative pro- 
tein involved in apoptosis. Its role in colonic carcinogen- 
esis is still to be elucidated. Finally, KLF4, an inhibitor of 
the cell cycle, has been recently found down-regulated in 
colonic [43] and gastric cancer. Loss of expression of KLF4 
is associated with cancer progression [44]. 



Discussion and conclusions 

The present paper describes a general methodology for the 
assessment of the statistical significance of prediction 
rules trained to classify DNA microarray data. The 
method, which can be considered a natural extension of 
the ones proposed in [12,13], provides statistically signif- 
icant answers to precise questions relevant to the diagno- 
sis and prognosis of cancer. The method has been applied 
to a new DNA microarray data set collected in Casa Sol- 
lievo del la Sofferenza Hospital, Foggia - Italy, relative to 
patients affected by colon cancer. We have found that it is 
possible to train statistically significant classifiers for 
colon cancer diagnosis with as few as 15 examples. This 
result agrees with the one described in [12] and it bears 
out the empirical observation that tumor morphological 
distinctions (including disease versus normal classifica- 
tion) are, in general, easier to deal with than those con- 
cerning the treatment outcome prediction. In our case, the 
best classification performance was achieved by training 
an SVM classifier with 35 examples, which produced an 
error rate of e = 1 1% (p = 0.019). This shows that the size 
of our data set is sufficient to build statistically significant 
classifiers for colon cancer diagnosis. 

Concerning the problem of determining a sufficient 
number of genes to be used for an accurate classification 
of colon cancer, our results suggest that it depends on the 
accuracy required. In fact, the error rate ranges between e 
= 1 1% (p = 0.025), obtained training SVM classifiers with 
g = 16384 genes, and e - 16% (p < 0.05) obtained training 
RLS or SVM classifiers with only g = 2 genes. This result 
indicates that a remarkable number of genes are altered in 
the pathology and that a lot of them convey useful infor- 
mation for the classification of new specimens. In order to 
verify such a result, the following experiment was carried 
out. We trained an SVM classifier with 35 examples each 
of which composed of 64 genes randomly drawn from the 
set of all the genes available, thus obtaining an error rate 
of e = 23% (p = 0.038). This value, although higher than 
the one obtained by using gene lists ranked with the 
statistic (see table 2), indicates that many different sets of 
64 genes can be used to build accurate classifiers. The 
behavior of e as a function of g is consistent and has been 
pointed out by other authors. For example, [45] finds a 
decreasing behavior of the error rate w.r.t. g by analyzing 
three microarray data sets, with different gene selection 
criteria. In conclusion, our results indicate that a highly 
accurate and statistically significant classification of colon 
specimens is possible even when a small number of genes 
is employed. 

Some conclusions can be drawn concerning the classifica- 
tion models involved in our analysis. WVA classifiers 
show poor generalization ability and they are greatly 
influenced by the noise embedded in the microarray data. 
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They rarely provide statistically significant classification 
performances and, for these reasons, they should not be 
used as predictors of DNA microarray data. On the con- 
trary, RLS classifiers performances are comparable to 
those of SVM classifiers, the state-of-the-art supervised 
learning machines in many application domains, includ- 
ing cancer classification by DNA microarray data [5]. The 
main advantage of RLS machines in solving a classifica- 
tion problem lies in their employment of a linear system 
of order equal to either the number of genes or the 
number of training examples. This property is extremely 
important and reduces the computational cost of the per- 
mutation test because, for a fixed random split of the data, 
the coefficients of random classifiers are obtained by mul- 
tiplying a constant matrix with vectors of randomly per- 
muted labels [16]. Moreover, RLS machines allow us to 
get an exact measure of the LOO error with just one train- 
ing. For all these reasons and because of their simplicity 
and low computational complexity, RLS classifiers pro- 
vide a valuable alternative to SVM classifiers with regard to 
the problem of cancer classification by gene expression 
data. Moreover, RLS classifiers show generalization abili- 
ties comparable to the ones of SVM classifiers even when 
the classification of new specimens involves very few gene 
expression levels. The last consideration concerns the way 
in which these two classification schemes represent the 
solution. SVM tends to give sparse solutions in terms of 
number of training examples and RLS tends to give sparse 
solutions in terms of number of features used for classify- 
ing. 

Colorectal cancer is the third most common cancer in 
men and women and accounts for 11% of all cancer 
deaths. Whereas the 5-year survival rate is extremely favo- 
rable when detected at a localized stage (90%), most 
colorectal cancers are either locally or distantly invasive at 
diagnosis, limiting treatment options and lowering sur- 
vival rates. Clearly, a more comprehensive view of the 
molecular events associated with colorectal tumorigenesis 
is needed to identify tumours earlier and to treat colorec- 
tal tumours more effectively. Microarray technology has 
the potential to detect tumour-specific genes which can be 
used as biomarkers for early diagnosis and specific treat- 
ments. Potential uses of this technology include determin- 
ing who will benefit from chemotherapy, further 
classifying patients into responders and nonresponders, 
predicting apoptotic response, developing classifiers to 
recognize chemosensitive tumors, identifying genes that 
portend a poor prognosis, revealing genes associated with 
metastases, predicting the outcome according to clinical 
stage, and avoiding surgery in patients who would not 
benefit from resection. 

In this study, by means of specific statistical methods, we 
have found several genes up- and down -regulated in 



colon cancer which could be used as diagnostic biomark- 
ers or therapeutic targets. Among the up-regulated genes, 
the most representative are those implicated in mitotic 
checkpoint signalling cascade and those controlling cell 
cycle progression. Inhibition of overexpressed genes is 
potentially useful to control cancer growth. Among the 
down -regulated genes, the most interesting for their 
potential therapeutic implication are those of apoptosis, 
intracellular and cell surface signalling, and cell arrest. 
Reactivation of their function could be useful to suppress 
cancer development or progression. A few of these up- 
and down-regulated genes have not been described in 
colon cancer yet. Further studies focused on these genes 
and related transcripts are necessary to better elucidate 
their pathogenic role in colon cancer disease and their 
clinical relevance in diagnostics and therapeutics. 
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